In this report, we will analyse data about LEGO. All data comes from
https://rebrickable.com/downloads/, which is a site that
allows users to share their MOCs (My Own Creation - constructions
created by LEGO fans).
We’ll start by looking at the data schema, the tables and the data
they contain. We will see how the tables interconnect and understand
what the values of the different attributes are. We will also learn
about the size of the entire dataset. The next step will be an analysis
on the data from individual or linked tables. We will start with
minifigures, then move on to parts and finish the analysis on sets.
During the analyses, we will wrestle with missing data. Last but not
least we will try to predict the future using forecasting. There we will
focus on all sets and a Star Wars set.
After the analysis, the most striking observation is that LEGO is
growing all the time. Every year there are more sets and the number of
parts increases. Some interesting insights can be reached by observing
the data on themes. Many of them are well-known brands (Ben 10, Avatar,
Angry Bird) that appear at the peak of the brand hype. In addition to
this, there are many generic themes (Town, City) that have been with
LEGO for a long time and are becoming more popular every year. It’s also
worth to note that at the beginning LEGO didn’t cooperate with many
external brands which is why we don’t see many sets of external brands
at the beginning of the company’s operation.
Most of the plots are interactive so keep in mind that if you want to
find out more about presented data just hover over it.
Libraries
library(ggplot2)
library(dplyr)
library(png)
library(grid)
library(plotly)
library(gganimate)
library(tidyr)
library(forecast)
Basic tables analysis
Data schema
We can clearly see that data is divided into 3 parts. Going from left
to right, the first concerns parts and their color, second concerns
minifigures and the last one is about sets. All parts have table with
prefix inventory (for example: inventory_sets) which contains stock
data. It is possible to connect data through inventories table.

Tables
Inventories
Table linking parts, minifigures and sets.
-
id: inventory id
-
version: inventory version
-
set_num: set id
| 1 |
1 |
7922-1 |
| 3 |
1 |
3931-1 |
| 4 |
1 |
6942-1 |
| 15 |
1 |
5158-1 |
| 16 |
1 |
903-1 |
| 17 |
1 |
850950-1 |
|
Min. : 1 |
Min. : 1.000 |
Length:37265 |
|
1st Qu.: 14424 |
1st Qu.: 1.000 |
Class :character |
|
Median : 54379 |
Median : 1.000 |
Mode :character |
|
Mean : 61104 |
Mean : 1.091 |
NA |
|
3rd Qu.: 88842 |
3rd Qu.: 1.000 |
NA |
|
Max. :194312 |
Max. :16.000 |
NA |
Inventory sets
Sets stock
-
inventory_id: inventory id
-
set_num: set id
-
quantity: number of sets
| 35 |
75911-1 |
1 |
| 35 |
75912-1 |
1 |
| 39 |
75048-1 |
1 |
| 39 |
75053-1 |
1 |
| 50 |
4515-1 |
1 |
| 50 |
4520-1 |
2 |
|
Min. : 1 |
Min. : 1.000 |
Length:37265 |
|
1st Qu.: 14424 |
1st Qu.: 1.000 |
Class :character |
|
Median : 54379 |
Median : 1.000 |
Mode :character |
|
Mean : 61104 |
Mean : 1.091 |
NA |
|
3rd Qu.: 88842 |
3rd Qu.: 1.000 |
NA |
|
Max. :194312 |
Max. :16.000 |
NA |
Sets
Data related to sets
-
set_num: set id
-
name: set name
-
year: when set was released
-
theme_id: theme id
-
num_parts: number of parts in set
|
Length:21880 |
Length:21880 |
Min. :1949 |
Min. : 1 |
Min. : 0.0 |
Length:21880 |
|
Class :character |
Class :character |
1st Qu.:2001 |
1st Qu.:273 |
1st Qu.: 3.0 |
Class :character |
|
Mode :character |
Mode :character |
Median :2012 |
Median :497 |
Median : 31.0 |
Mode :character |
|
NA |
NA |
Mean :2008 |
Mean :442 |
Mean : 161.4 |
NA |
|
NA |
NA |
3rd Qu.:2018 |
3rd Qu.:608 |
3rd Qu.: 139.0 |
NA |
|
NA |
NA |
Max. :2024 |
Max. :752 |
Max. :11695.0 |
NA |
Themes
Data related to themes
-
id: theme id
-
name: theme name
-
parent_id: theme id of a parent theme (self-referential
foreign key)
| 1 |
Technic |
NA |
| 3 |
Competition |
1 |
| 4 |
Expert Builder |
1 |
| 16 |
RoboRiders |
1 |
| 17 |
Speed Slammers |
1 |
| 18 |
Star Wars |
1 |
|
Min. : 1.0 |
Length:468 |
Min. : 1.0 |
|
1st Qu.:250.5 |
Class :character |
1st Qu.:186.0 |
|
Median :466.0 |
Mode :character |
Median :411.0 |
|
Mean :433.5 |
NA |
Mean :360.6 |
|
3rd Qu.:625.2 |
NA |
3rd Qu.:512.5 |
|
Max. :752.0 |
NA |
Max. :697.0 |
|
NA |
NA |
NA’s :145 |
Inventory Minifigs
Minifigures stock
-
inventory_id: inventory id
-
fig_num: minifigures id
-
quantity: number of minifigures
| 3 |
fig-001549 |
1 |
| 4 |
fig-000764 |
1 |
| 19 |
fig-000555 |
1 |
| 25 |
fig-000574 |
1 |
| 26 |
fig-000842 |
1 |
| 26 |
fig-008641 |
1 |
|
Min. : 3 |
Length:20858 |
Min. : 1.000 |
|
1st Qu.: 7869 |
Class :character |
1st Qu.: 1.000 |
|
Median : 15681 |
Mode :character |
Median : 1.000 |
|
Mean : 43010 |
NA |
Mean : 1.062 |
|
3rd Qu.: 66834 |
NA |
3rd Qu.: 1.000 |
|
Max. :194312 |
NA |
Max. :100.000 |
Minifigs
Data related to minifigures
-
fig_num: minifigures id
-
name: minifigure name
-
num_parts: number of parts in minifigure
|
Length:13764 |
Length:13764 |
Min. : 0.000 |
Length:13764 |
|
Class :character |
Class :character |
1st Qu.: 4.000 |
Class :character |
|
Mode :character |
Mode :character |
Median : 4.000 |
Mode :character |
|
NA |
NA |
Mean : 5.296 |
NA |
|
NA |
NA |
3rd Qu.: 5.000 |
NA |
|
NA |
NA |
Max. :156.000 |
NA |
Inventory parts
Parts stock
-
inventory_id: inventory id
-
part_num: part id
-
color_id: color id
-
quantity: number of parts
-
is_spare: is part a spare part
|
Min. : 1 |
Length:1180987 |
Min. : -1.0 |
Min. : 1.00 |
Length:1180987 |
Length:1180987 |
|
1st Qu.: 9404 |
Class :character |
1st Qu.: 4.0 |
1st Qu.: 1.00 |
Class :character |
Class :character |
|
Median : 22838 |
Mode :character |
Median : 15.0 |
Median : 2.00 |
Mode :character |
Mode :character |
|
Mean : 50849 |
NA |
Mean : 131.8 |
Mean : 3.37 |
NA |
NA |
|
3rd Qu.: 87088 |
NA |
3rd Qu.: 71.0 |
3rd Qu.: 4.00 |
NA |
NA |
|
Max. :194312 |
NA |
Max. :9999.0 |
Max. :3064.00 |
NA |
NA |
Colors
Data related to elements color
-
id: color id
-
name: color name
-
rgb: color in rgb format
-
is_trans: is color a transparent color
| -1 |
[Unknown] |
0033B2 |
f |
| 0 |
Black |
05131D |
f |
| 1 |
Blue |
0055BF |
f |
| 2 |
Green |
237841 |
f |
| 3 |
Dark Turquoise |
008F9B |
f |
| 4 |
Red |
C91A09 |
f |
|
Min. : -1.0 |
Length:263 |
Length:263 |
Length:263 |
|
1st Qu.: 83.0 |
Class :character |
Class :character |
Class :character |
|
Median :1005.0 |
Mode :character |
Mode :character |
Mode :character |
|
Mean : 651.4 |
NA |
NA |
NA |
|
3rd Qu.:1070.5 |
NA |
NA |
NA |
|
Max. :9999.0 |
NA |
NA |
NA |
Parts
Data related to parts
-
part_num: part id
-
name: part name
-
part_cat_id: part category id
| 003381 |
Sticker Sheet for Set 663-1 |
58 |
Plastic |
| 003383 |
Sticker Sheet for Sets 618-1, 628-2 |
58 |
Plastic |
| 003402 |
Sticker Sheet for Sets 310-3, 311-1, 312-3 |
58 |
Plastic |
| 003429 |
Sticker Sheet for Set 1550-1 |
58 |
Plastic |
| 003432 |
Sticker Sheet for Sets 357-1, 355-1, 940-1 |
58 |
Plastic |
| 003434 |
Sticker Sheet for Set 575-2, 653-1, 460-1 |
58 |
Plastic |
|
Length:52615 |
Length:52615 |
Min. : 1.00 |
Length:52615 |
|
Class :character |
Class :character |
1st Qu.:17.00 |
Class :character |
|
Mode :character |
Mode :character |
Median :41.00 |
Mode :character |
|
NA |
NA |
Mean :38.91 |
NA |
|
NA |
NA |
3rd Qu.:60.00 |
NA |
|
NA |
NA |
Max. :68.00 |
NA |
Elements
Data related to part elements
-
element_id: element id
-
part_num: part id
-
color_id: color id
| 6443403 |
2277c01pr0009 |
1 |
2277 |
| 6300211 |
67906c01 |
14 |
67908 |
| 4566309 |
2564 |
0 |
2564 |
| 4275423 |
53657 |
1004 |
53657 |
| 6194308 |
92926 |
71 |
28967 |
| 6229123 |
26561 |
4 |
26561 |
|
Min. : 9327 |
Length:84138 |
Min. : -1.0 |
Min. : 1001 |
|
1st Qu.: 4259774 |
Class :character |
1st Qu.: 8.0 |
1st Qu.: 18454 |
|
Median : 6057754 |
Mode :character |
Median : 28.0 |
Median : 41748 |
|
Mean : 5222065 |
NA |
Mean : 539.7 |
Mean : 45570 |
|
3rd Qu.: 6262024 |
NA |
3rd Qu.: 135.0 |
3rd Qu.: 75475 |
|
Max. :61532443 |
NA |
Max. :9999.0 |
Max. :107520 |
|
NA |
NA |
NA |
NA’s :23682 |
Part Relationships
Data related to parts relationship
-
rel_type: type of relationship (rel_types are: (P)rint,
Pai(R), Su(B)-Part, (M)old, Pa(T)tern, (A)lternate)
-
child_part_num: id of child part (foreign key to parts)
-
parent_part_num: id of parent part (foreign key to
parts)
| P |
3626cpr3662 |
3626c |
| P |
87079pr9974 |
87079 |
| P |
3960pr9971 |
3960 |
| R |
98653pr0003 |
98086pr0003 |
| R |
98653pr0003 |
98088pat0003 |
| R |
98653pr0003 |
98089pat0003 |
|
Length:29977 |
Length:29977 |
Length:29977 |
|
Class :character |
Class :character |
Class :character |
|
Mode :character |
Mode :character |
Mode :character |
Part categories
Data related to parts category
-
id: category id
-
name: category name
| 1 |
Baseplates |
| 3 |
Bricks Sloped |
| 4 |
Duplo, Quatro and Primo |
| 5 |
Bricks Special |
| 6 |
Bricks Wedged |
| 7 |
Containers |
|
Min. : 1.00 |
Length:66 |
|
1st Qu.:19.25 |
Class :character |
|
Median :35.50 |
Mode :character |
|
Mean :35.36 |
NA |
|
3rd Qu.:51.75 |
NA |
|
Max. :68.00 |
NA |
Dataframes size
-
All rows in dataframes: 1446639
-
All columns in dataframes 45
-
All values in dataframes: 8099232
Analyses
Parts
Sum of rows with NA value: 0
Material Type of Lego Parts
Log10(Number of Parts) is used instead of just
displaying Number of Parts because difference between
Plastic and other materials is so big that they would
be almost invisible on plot.
Part categories
Sum of rows with NA value: 0
Parts with categories
Sum of rows with NA value: 0
Part of the table with parts and categories
| 003381 |
58 |
Plastic |
Stickers |
| 003383 |
58 |
Plastic |
Stickers |
| 003402 |
58 |
Plastic |
Stickers |
| 003429 |
58 |
Plastic |
Stickers |
| 003432 |
58 |
Plastic |
Stickers |
| 003434 |
58 |
Plastic |
Stickers |
We can see that there are 8 main categories (Tiles Round and Curved -
Tubes and Hoses)
Elements
Sum of rows with NA value: 23682
All NA values are in design_id column
In the data schema there is no indication of what the design_id is or
where it is located, so we will remove the column from the dataframe
Colors
Sum of rows with NA value: 0
Parts with Colors
| 003381 |
58 |
Plastic |
Stickers |
NA |
NA |
NA |
NA |
NA |
| 003383 |
58 |
Plastic |
Stickers |
NA |
NA |
NA |
NA |
NA |
| 003402 |
58 |
Plastic |
Stickers |
NA |
NA |
NA |
NA |
NA |
| 003429 |
58 |
Plastic |
Stickers |
NA |
NA |
NA |
NA |
NA |
| 003432 |
58 |
Plastic |
Stickers |
NA |
NA |
NA |
NA |
NA |
| 003434 |
58 |
Plastic |
Stickers |
NA |
NA |
NA |
NA |
NA |
Sum of rows with NA value: 94250
In this case NA values mean that there are some parts that do not
consist of elements
| 0 |
0 |
0 |
0 |
18850 |
18850 |
18850 |
18850 |
18850 |
Parts without color are the parts that do not consist of elements
Themes
Themes sum of rows with NA value
Sum of rows with NA value: 145
All NA values are in parent_id column
Parent themes collect other themes, which are their sub-themes. Part
of parent themes names:
## [1] "Technic" "Creator" "Make & Create" "Town"
## [5] "City" "Racers"
Join parent themes with themes. Insert parent theme for their
children.
| 1 |
Technic |
Technic |
| 3 |
Competition |
Technic |
| 4 |
Expert Builder |
Technic |
| 16 |
RoboRiders |
Technic |
| 17 |
Speed Slammers |
Technic |
| 18 |
Star Wars |
Technic |
Sets
Sum of rows with NA value: 0
In sets dataframe there is data from 2024 and 2023. At the time the
report was created, the data for these years was not yet complete so we
will delete each observation from 2024 and 2023.
Histogram for Number of Parts in sets_df
We will remove some rows from sets_df which has really big Number of
Parts. They diverge significantly from other values, making our graph
unreadable.
Number of rows that we ignore: 19

Correlation between year and num_parts
Pearson correlation between year and Sum of
Parts: 0.8061769.

The most interesting thing in heatmap below is really low correlation
between median and sum_parts/count_sets. This means that altough the
number of parts and sets is increasing the median does not share this
trend.

Sets with Themes
Sum of rows with NA value: 0
Part of the table with sets and themes
We can see that most themes have less than 500 sets and less than
50.000 parts.
Largest stock of sets
|
name
|
sum_quantity
|
img_tag
|
|
Series 19 - Random Bag
|
110
|
|
|
Unikitty! Series 1 - Random Bag
|
60
|
|
|
Series 9 - Random Bag
|
60
|
|
|
Series 10 - Random Bag
|
60
|
|
|
Series 11 - Random Bag
|
60
|
|
|
The LEGO Movie Series 1 - Random Bag
|
60
|
|
|
The Simpsons Series 1 - Random Bag
|
60
|
|
|
Series 12 - Random Bag
|
60
|
|
|
Series 13 - Random Bag
|
60
|
|
|
The Simpsons Series 2 - Random Bag
|
60
|
|
Forecasting
Forecast Number of Sets

Model predicts that in 2025 there will be:
1127 sets and in 2030 there will be:
1201 sets.
Forecast Number of Parts in all Sets

Model predicts that in 2025 there will be:
394662 parts and in 2030 there will
be: 507113 parts.
Forecast Number of Star Wars Sets

Model predicts that in 2025 there will be:
42 Star Wars sets and in 2030 there
will be: 42 Star Wars sets.
Forecast Number of Parts in Star Wars Sets

Model predicts that in 2025 there will be:
24861 Star Wars parts and in 2030
there will be: 28797 Star Wars parts.
Github: https://github.com/C7A7A/lego_analysis